Planning for Markov Decision Processes with Sparse Stochasticity
نویسندگان
چکیده
Planning algorithms designed for deterministic worlds, such as A* search, usually run much faster than algorithms designed for worlds with uncertain action outcomes, such as value iteration. Real-world planning problems often exhibit uncertainty, which forces us to use the slower algorithms to solve them. Many real-world planning problems exhibit sparse uncertainty: there are long sequences of deterministic actions which accomplish tasks like moving sensor platforms into place, interspersed with a small number of sensing actions which have uncertain outcomes. In this paper we describe a new planning algorithm, called MCP (short for MDP Compression Planning), which combines A* search with value iteration for solving Stochastic Shortest Path problem in MDPs with sparse stochasticity. We present experiments which show that MCP can run substantially faster than competing planners in domains with sparse uncertainty; these experiments are based on a simulation of a ground robot cooperating with a helicopter to fill in a partial map and move to a goal location. In deterministic planning problems, optimal paths are acyclic: no state is visited more than once. Because of this property, algorithms like A* search can guarantee that they visit each state in the state space no more than once. By visiting the states in an appropriate order, it is possible to ensure that we know the exact value of all of a state’s possible successors before we visit that state; so, the first time we visit a state we can compute its correct value. By contrast, if actions have uncertain outcomes, optimal paths may contain cycles: some states will be visited two or more times with positive probability. Because of these cycles, there is no way to order states so that we determine the values of a state’s successors before we visit the state itself. Instead, the only way to compute state values is to solve a set of simultaneous equations. In problems with sparse stochasticity, only a small fraction of all states have uncertain outcomes. It is these few states that cause all of the cycles: while a deterministic state s may participate in a cycle, the only way it can do so is if one of its successors has an action with a stochastic outcome (and only if this stochastic action can lead to a predecessor of s). In such problems, we would like to build a smaller MDP which contains only states which are related to stochastic actions. We will call such an MDP a compressed MDP, and we will call its states distinguished states. We could then run fast algorithms like A* search to plan paths between distinguished states, and reserve slower algorithms like value iteration for deciding how to deal with stochastic outcomes. (a) Segbot (b) Robotic helicopter (d) Planning map (e) Execution simulation (c) 3D Map Figure 1: Robot-Helicopter Coordination There are two problems with such a strategy. First, there can be a large number of states which are related to stochastic actions, and so it may be impractical to enumerate all of them and make them all distinguished states; we would prefer instead to distinguish only states which are likely to be encountered while executing some policy which we are considering. Second, there can be a large number of ways to get from one distinguished state to another: edges in the compressed MDP correspond to sequences of actions in the original MDP. If we knew the values of all of the distinguished states exactly, then we could use A* search to generate optimal paths between them, but since we do not we cannot. In this paper, we will describe an algorithm which incrementally builds a compressed MDP using a sequence of deterministic searches. It adds states and edges to the compressed MDP only by encountering them along trajectories; so, it never adds irrelevant states or edges to the compressed MDP. Trajectories are generated by deterministic search, and so undistinguished states are treated only with fast algorithms. Bellman errors in the values for distinguished states show us where to try additional trajectories, and help us build the relevant parts of the compressed MDP as quickly as possible. 1 Robot-Helicopter Coordination Problem The motivation for our research was the problem of coordinating a ground robot and a helicopter. The ground robot needs to plan a path from its current location to a goal, but has only partial knowledge of the surrounding terrain. The helicopter can aid the ground robot by flying to and sensing places in the map. Figure 1(a) shows our ground robot, a converted Segway with a SICK laser rangefinder. Figure 1(b) shows the helicopter, also with a SICK. Figure 1(c) shows a 3D map of the environment in which the robot operates. The 3D map is post-processed to produce a discretized 2D environment (Figure 1(d)). Several places in the map are unknown, either because the robot has not visited them or because their status may have changed (e.g, a car may occupy a driveway). Such places are shown in Figure 1(d) as white squares. The elevation of each white square is proportional to the probability that there is an obstacle there; we assume independence between unknown squares. The robot must take the unknown locations into account when planning for its route. It may plan a path through these locations, but it risks having to turn back if its way is blocked. Alternately, the robot can ask the helicopter to fly to any of these places and sense them. We assign a cost to running the robot, and a somewhat higher cost to running the helicopter. The planning task is to minimize the expected overall cost of running the robot and the helicopter while getting the robot to its destination and the helicopter back to its home base. Figure 1(e) shows a snapshot of the robot and helicopter executing a policy. Designing a good policy for the robot and helicopter is a POMDP planning problem; unfortunately POMDPs are in general difficult to solve (PSPACE-complete [7]). In the POMDP representation, a state is the position of the robot, the current location of the helicopter (a point on a line segment from one of the unknown places to another unknown place or the home base), and the true status of each unknown location. The positions of the robot and the helicopter are observable, so that the only hidden variables are whether each unknown place is occupied. The number of states is (# of robot locations)×(# of helicopter locations)×2 of unknown . So, the number of states is exponential in the number of unknown places and therefore quickly becomes very large. We approach the problem by planning in the belief state space, that is, the space of probability distributions over world states. This problem is a continuous-state MDP; in this belief MDP, our state consists of the ground robot’s location, the helicopter’s location, and a probability of occupancy for each unknown location. We will discretize the continuous probability variables by breaking the interval [0, 1] into several chunks; so, the number of belief states is exponential in the number of unknown places, and classical algorithms such as value iteration are infeasible even on small problems. If sensors are perfect, this domain is acyclic: after we sense a square we know its true state forever after. On the other hand, imperfect sensors can lead to cycles: new sensor data can contradict older sensor data and lead to increased uncertainty. With or without sensor noise, our belief state MDP differs from general MDPs because its stochastic transitions are sparse: large portions of the policy (while the robot and helicopter are traveling between unknown locations) are deterministic. The algorithm we propose in this paper takes advantage of this property of the problem, as we explain in the next section.
منابع مشابه
Accelerated decomposition techniques for large discounted Markov decision processes
Many hierarchical techniques to solve large Markov decision processes (MDPs) are based on the partition of the state space into strongly connected components (SCCs) that can be classified into some levels. In each level, smaller problems named restricted MDPs are solved, and then these partial solutions are combined to obtain the global solution. In this paper, we first propose a novel algorith...
متن کاملApproximate Policy Iteration with a Policy Language Bias: Solving Relational Markov Decision Processes
We study an approach to policy selection for large relational Markov Decision Processes (MDPs). We consider a variant of approximate policy iteration (API) that replaces the usual value-function learning step with a learning step in policy space. This is advantageous in domains where good policies are easier to represent and learn than the corresponding value functions, which is often the case ...
متن کاملAdaptive Planning for Markov Decision Processes with Uncertain Transition Models via Incremental Feature Dependency Discovery
Solving large scale sequential decision making problems without prior knowledge of the state transition model is a key problem in the planning literature. One approach to tackle this problem is to learn the state transition model online using limited observed measurements. We present an adaptive function approximator (incremental Feature Dependency Discovery (iFDD)) that grows the set of featur...
متن کاملApproximate planning for decentralized MDPs with sparse interactions
We explore how local interactions can simplify the process of decision-making in multiagent systems. We review decentralized sparse-interaction Markov decision process [3] that explicitly distinguishes the situations in which the agents in the team must coordinate from those in which they can act independently. We situate this class of problems within different multiagent models, such as MMDPs ...
متن کاملTeamwork and Coordination under Model Uncertainty in DEC-POMDPs
Distributed Partially Observable Markov Decision Processes (DEC-POMDPs) are a popular planning framework for multiagent teamwork to compute (near-)optimal plans. However, these methods assume a complete and correct world model, which is often violated in real-world domains. We provide a new algorithm for DEC-POMDPs that is more robust to model uncertainty, with a focus on domains with sparse ag...
متن کاملSpeech enhancement based on hidden Markov model using sparse code shrinkage
This paper presents a new hidden Markov model-based (HMM-based) speech enhancement framework based on the independent component analysis (ICA). We propose analytical procedures for training clean speech and noise models by the Baum re-estimation algorithm and present a Maximum a posterior (MAP) estimator based on Laplace-Gaussian (for clean speech and noise respectively) combination in the HMM ...
متن کامل